Um cliente solicita que você construa um score de crédito customizado para um novo público. Uma amostra analítica contendo 1000 registros foi extraída do banco de dados. O arquivo contendo a amostra de dados se encontra no seguinte endereço: (https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data)
Attribute 1: (qualitative) Status of existing checking account A11 : ... < 0 DM A12 : 0 <= ... < 200 DM A13 : ... >= 200 DM / salary assignments for at least 1 year A14 : no checking account
Attribute 2: (numerical) Duration in month
Attribute 3: (qualitative) Credit history A30 : no credits taken/ all credits paid back duly A31 : all credits at this bank paid back duly A32 : existing credits paid back duly till now A33 : delay in paying off in the past A34 : critical account/ other credits existing (not at this bank)
Attribute 4: (qualitative) Purpose A40 : car (new) A41 : car (used) A42 : furniture/equipment A43 : radio/television A44 : domestic appliances A45 : repairs A46 : education A47 : (vacation - does not exist?) A48 : retraining A49 : business A410 : others
Attribute 5: (numerical) Credit amount
Attribute 6: (qualitative) Savings account/bonds A61 : ... < 100 DM A62 : 100 <= ... < 500 DM A63 : 500 <= ... < 1000 DM A64 : .. >= 1000 DM A65 : unknown/ no savings account
Attribute 7: (qualitative) Present employment since A71 : unemployed A72 : ... < 1 year A73 : 1 <= ... < 4 years A74 : 4 <= ... < 7 years A75 : .. >= 7 years
Attribute 8: (numerical) Installment rate in percentage of disposable income
Attribute 9: (qualitative) Personal status and sex A91 : male : divorced/separated A92 : female : divorced/separated/married A93 : male : single A94 : male : married/widowed A95 : female : single
Attribute 10: (qualitative) Other debtors / guarantors A101 : none A102 : co-applicant A103 : guarantor
Attribute 11: (numerical) Present residence since
Attribute 12: (qualitative) Property A121 : real estate A122 : if not A121 : building society savings agreement/ life insurance A123 : if not A121/A122 : car or other, not in attribute 6 A124 : unknown / no property
Attribute 13: (numerical) Age in years
Attribute 14: (qualitative) Other installment plans A141 : bank A142 : stores A143 : none
Attribute 15: (qualitative) Housing A151 : rent A152 : own A153 : for free
Attribute 16: (numerical) Number of existing credits at this bank
Attribute 17: (qualitative) Job A171 : unemployed/ unskilled - non-resident A172 : unskilled - resident A173 : skilled employee / official A174 : management/ self-employed/ highly qualified employee/ officer
Attribute 18: (numerical) Number of people being liable to provide maintenance for
Attribute 19: (qualitative) Telephone A191 : none A192 : yes, registered under the customers name
Attribute 20: (qualitative) foreign worker A201 : yes A202 : no
Attribute 21: (numerical) response variable 1: bad 2: good O atributo binário “response variable” é a variável resposta do problema em que a categoria “bad” representa clientes inadimplentes (maus pagadores) e “good” clientes que pagam suas contas em dia (bons pagadores).
# Importando as bibliotecas
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Adicionando o conjunto de dados
dados = pd.read_table('german.txt', sep=' ')
# Adicionando o nome das colunas
dados.columns = ['checking_account', 'duration_month', 'credit_history', 'purpose', 'credit_amount', 'savings_account', 'employed_since', 'installment_rate', 'status_sex', 'other_debtors', 'residence_since', 'property', 'age', 'other_installment', 'housing', 'number_existing_credits', 'job', 'number_people_provide_maintenance', 'telephone', 'foreign_worker', 'response_variable']
display(dados)
| checking_account | duration_month | credit_history | purpose | credit_amount | savings_account | employed_since | installment_rate | status_sex | other_debtors | ... | property | age | other_installment | housing | number_existing_credits | job | number_people_provide_maintenance | telephone | foreign_worker | response_variable | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | A12 | 48 | A32 | A43 | 5951 | A61 | A73 | 2 | A92 | A101 | ... | A121 | 22 | A143 | A152 | 1 | A173 | 1 | A191 | A201 | 2 |
| 1 | A14 | 12 | A34 | A46 | 2096 | A61 | A74 | 2 | A93 | A101 | ... | A121 | 49 | A143 | A152 | 1 | A172 | 2 | A191 | A201 | 1 |
| 2 | A11 | 42 | A32 | A42 | 7882 | A61 | A74 | 2 | A93 | A103 | ... | A122 | 45 | A143 | A153 | 1 | A173 | 2 | A191 | A201 | 1 |
| 3 | A11 | 24 | A33 | A40 | 4870 | A61 | A73 | 3 | A93 | A101 | ... | A124 | 53 | A143 | A153 | 2 | A173 | 2 | A191 | A201 | 2 |
| 4 | A14 | 36 | A32 | A46 | 9055 | A65 | A73 | 2 | A93 | A101 | ... | A124 | 35 | A143 | A153 | 1 | A172 | 2 | A192 | A201 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 994 | A14 | 12 | A32 | A42 | 1736 | A61 | A74 | 3 | A92 | A101 | ... | A121 | 31 | A143 | A152 | 1 | A172 | 1 | A191 | A201 | 1 |
| 995 | A11 | 30 | A32 | A41 | 3857 | A61 | A73 | 4 | A91 | A101 | ... | A122 | 40 | A143 | A152 | 1 | A174 | 1 | A192 | A201 | 1 |
| 996 | A14 | 12 | A32 | A43 | 804 | A61 | A75 | 4 | A93 | A101 | ... | A123 | 38 | A143 | A152 | 1 | A173 | 1 | A191 | A201 | 1 |
| 997 | A11 | 45 | A32 | A43 | 1845 | A61 | A73 | 4 | A93 | A101 | ... | A124 | 23 | A143 | A153 | 1 | A173 | 1 | A192 | A201 | 2 |
| 998 | A12 | 45 | A34 | A41 | 4576 | A62 | A71 | 3 | A93 | A101 | ... | A123 | 27 | A143 | A152 | 1 | A173 | 1 | A191 | A201 | 1 |
999 rows × 21 columns
# Analisando o tipo de dados de cada coluna
dados.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 999 entries, 0 to 998 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 checking_account 999 non-null object 1 duration_month 999 non-null int64 2 credit_history 999 non-null object 3 purpose 999 non-null object 4 credit_amount 999 non-null int64 5 savings_account 999 non-null object 6 employed_since 999 non-null object 7 installment_rate 999 non-null int64 8 status_sex 999 non-null object 9 other_debtors 999 non-null object 10 residence_since 999 non-null int64 11 property 999 non-null object 12 age 999 non-null int64 13 other_installment 999 non-null object 14 housing 999 non-null object 15 number_existing_credits 999 non-null int64 16 job 999 non-null object 17 number_people_provide_maintenance 999 non-null int64 18 telephone 999 non-null object 19 foreign_worker 999 non-null object 20 response_variable 999 non-null int64 dtypes: int64(8), object(13) memory usage: 164.0+ KB
# Analisando os dados com pandas profiling
from ipywidgets import HTML, Button, widgets #pip install ipywidgets
from pandas_profiling.report.presentation.core import Alerts
from pandas_profiling.report.presentation.flavours.html import templates
profile = ProfileReport(dados)
profile.to_file('analise_exploratoria.html')
profile
Com a ajuda do pandas profiling, podemos observar a distribuição e variação dos dados de uma forma rápida e visual.
Algumas curiosidades sobre o dataset:
A duração em meses possui um valor médio de 20,9 meses, variando de 4 a 72.
A maior parte dos dados se concentram nas categorias A32 (todos os créditos antigos pagos até agora), seguido pelo A34 (situação crítica financeira em outras instituições)
Os propósitos são bem distríbuidos, sendo os principais radio/televisão (A43), carro novo (A40) e móveis (A42).
A médida de quantidade de crédito é de 3273, variando de 250 a 18.424
A grande maioria dos dados pertem a categoria A61 (< 100 DM) em se tratando de reservas financeiras (savings_account)
Homens solteiros é a variável mais comum no dataset (A93).
A idade varia de 19 a 75 anos, com uma média de 35,5 anos.
A grande parte do tipo de residência é residência própria (A152)
A maior parte das pessoas possuem um emprego formal (A173)
Quase que a totalidade de trabalhadores não são estrangeiros (A202)
Aproximadamente 2/3 das pessoas na base não possuem uma boa reputação de crédito (1).
# separando os dataframes para transformar em dummies (conversão de variáveis categóricas para número, sem considerar pesos númericos)
# célula executada apenas uma vez, se não deleta a coluna 'response_variable' e teríamos que importar e renomear novamente o arquivo
# Caso precise, só descomentar as linhas abaixo para repetir o processo.
dados = pd.read_table('german.txt', sep=' ')
dados.columns = ['checking_account', 'duration_month', 'credit_history', 'purpose', 'credit_amount', 'savings_account', 'employed_since', 'installment_rate', 'status_sex', 'other_debtors', 'residence since', 'property', 'age', 'other_installment', 'housing', 'number_existing_credits', 'job', 'number_people_provide_maintenance', 'telephone', 'foreign_worker', 'response_variable']
dados_y = dados['response_variable']
dados_x = dados.drop(dados.columns[-1], axis=1)
print(dados_x)
checking_account duration_month credit_history purpose credit_amount \
0 A12 48 A32 A43 5951
1 A14 12 A34 A46 2096
2 A11 42 A32 A42 7882
3 A11 24 A33 A40 4870
4 A14 36 A32 A46 9055
.. ... ... ... ... ...
994 A14 12 A32 A42 1736
995 A11 30 A32 A41 3857
996 A14 12 A32 A43 804
997 A11 45 A32 A43 1845
998 A12 45 A34 A41 4576
savings_account employed_since installment_rate status_sex other_debtors \
0 A61 A73 2 A92 A101
1 A61 A74 2 A93 A101
2 A61 A74 2 A93 A103
3 A61 A73 3 A93 A101
4 A65 A73 2 A93 A101
.. ... ... ... ... ...
994 A61 A74 3 A92 A101
995 A61 A73 4 A91 A101
996 A61 A75 4 A93 A101
997 A61 A73 4 A93 A101
998 A62 A71 3 A93 A101
residence since property age other_installment housing \
0 2 A121 22 A143 A152
1 3 A121 49 A143 A152
2 4 A122 45 A143 A153
3 4 A124 53 A143 A153
4 4 A124 35 A143 A153
.. ... ... ... ... ...
994 4 A121 31 A143 A152
995 4 A122 40 A143 A152
996 4 A123 38 A143 A152
997 4 A124 23 A143 A153
998 4 A123 27 A143 A152
number_existing_credits job number_people_provide_maintenance \
0 1 A173 1
1 1 A172 2
2 1 A173 2
3 2 A173 2
4 1 A172 2
.. ... ... ...
994 1 A172 1
995 1 A174 1
996 1 A173 1
997 1 A173 1
998 1 A173 1
telephone foreign_worker
0 A191 A201
1 A191 A201
2 A191 A201
3 A191 A201
4 A192 A201
.. ... ...
994 A191 A201
995 A192 A201
996 A191 A201
997 A192 A201
998 A191 A201
[999 rows x 20 columns]
# Convertendo os valores categoricos em numeros para usar no modelo de regressão.
# Conversão para visualizarmos o pandas profiling
dados_dummies = pd.get_dummies(dados)
profile = ProfileReport(dados_dummies)
profile
# Convertendo valores de string para valores númericos para conseguirmos usar no modelo de Regressão.
dummies = pd.get_dummies(dados_x)
print(dummies.columns)
print(dummies)
Index(['duration_month', 'credit_amount', 'installment_rate',
'residence since', 'age', 'number_existing_credits',
'number_people_provide_maintenance', 'checking_account_A11',
'checking_account_A12', 'checking_account_A13', 'checking_account_A14',
'credit_history_A30', 'credit_history_A31', 'credit_history_A32',
'credit_history_A33', 'credit_history_A34', 'purpose_A40',
'purpose_A41', 'purpose_A410', 'purpose_A42', 'purpose_A43',
'purpose_A44', 'purpose_A45', 'purpose_A46', 'purpose_A48',
'purpose_A49', 'savings_account_A61', 'savings_account_A62',
'savings_account_A63', 'savings_account_A64', 'savings_account_A65',
'employed_since_A71', 'employed_since_A72', 'employed_since_A73',
'employed_since_A74', 'employed_since_A75', 'status_sex_A91',
'status_sex_A92', 'status_sex_A93', 'status_sex_A94',
'other_debtors_A101', 'other_debtors_A102', 'other_debtors_A103',
'property_A121', 'property_A122', 'property_A123', 'property_A124',
'other_installment_A141', 'other_installment_A142',
'other_installment_A143', 'housing_A151', 'housing_A152',
'housing_A153', 'job_A171', 'job_A172', 'job_A173', 'job_A174',
'telephone_A191', 'telephone_A192', 'foreign_worker_A201',
'foreign_worker_A202'],
dtype='object')
duration_month credit_amount installment_rate residence since age \
0 48 5951 2 2 22
1 12 2096 2 3 49
2 42 7882 2 4 45
3 24 4870 3 4 53
4 36 9055 2 4 35
.. ... ... ... ... ...
994 12 1736 3 4 31
995 30 3857 4 4 40
996 12 804 4 4 38
997 45 1845 4 4 23
998 45 4576 3 4 27
number_existing_credits number_people_provide_maintenance \
0 1 1
1 1 2
2 1 2
3 2 2
4 1 2
.. ... ...
994 1 1
995 1 1
996 1 1
997 1 1
998 1 1
checking_account_A11 checking_account_A12 checking_account_A13 ... \
0 0 1 0 ...
1 0 0 0 ...
2 1 0 0 ...
3 1 0 0 ...
4 0 0 0 ...
.. ... ... ... ...
994 0 0 0 ...
995 1 0 0 ...
996 0 0 0 ...
997 1 0 0 ...
998 0 1 0 ...
housing_A152 housing_A153 job_A171 job_A172 job_A173 job_A174 \
0 1 0 0 0 1 0
1 1 0 0 1 0 0
2 0 1 0 0 1 0
3 0 1 0 0 1 0
4 0 1 0 1 0 0
.. ... ... ... ... ... ...
994 1 0 0 1 0 0
995 1 0 0 0 0 1
996 1 0 0 0 1 0
997 0 1 0 0 1 0
998 1 0 0 0 1 0
telephone_A191 telephone_A192 foreign_worker_A201 foreign_worker_A202
0 1 0 1 0
1 1 0 1 0
2 1 0 1 0
3 1 0 1 0
4 0 1 1 0
.. ... ... ... ...
994 1 0 1 0
995 0 1 1 0
996 1 0 1 0
997 0 1 1 0
998 1 0 1 0
[999 rows x 61 columns]
# Carregando os dados em um array numpy
X = np.array(dummies.values)
y = np.array(dados_y.values)
print(X)
print(X.shape)
print(y)
print(y.shape)
[[ 48 5951 2 ... 0 1 0] [ 12 2096 2 ... 0 1 0] [ 42 7882 2 ... 0 1 0] ... [ 12 804 4 ... 0 1 0] [ 45 1845 4 ... 1 1 0] [ 45 4576 3 ... 0 1 0]] (999, 61) [2 1 1 2 1 1 1 1 2 2 2 1 2 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 2 1 1 2 1 1 2 2 1 1 1 1 2 1 1 1 1 1 2 1 2 1 1 1 2 1 1 1 1 1 1 2 1 2 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 2 1 1 2 1 2 1 2 1 1 1 2 1 1 2 1 2 1 2 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 2 2 1 2 1 2 2 1 1 1 1 2 2 2 1 2 1 2 1 2 1 2 2 2 1 2 2 1 2 1 2 1 1 1 2 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 1 2 1 1 1 1 2 2 2 1 1 2 1 2 1 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 2 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 2 1 1 1 1 1 2 2 1 2 1 1 2 2 1 1 1 1 2 1 2 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 2 2 2 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 2 1 2 1 2 1 2 1 1 1 1 2 1 1 1 2 1 1 1 1 1 2 2 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 2 1 1 1 2 1 1 2 1 2 1 2 1 1 2 1 1 1 1 2 1 1 1 1 2 1 2 1 1 1 2 1 1 1 2 1 1 1 2 2 1 2 1 1 2 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 2 2 2 1 2 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 2 2 1 1 1 2 1 1 2 2 2 1 2 1 1 2 1 1 1 1 1 1 2 1 1 1 2 2 1 1 1 1 1 2 1 1 2 1 1 1 2 1 1 2 1 2 1 2 2 1 2 1 1 2 1 1 1 2 1 1 2 2 2 2 2 1 2 1 2 1 1 2 1 1 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 1 2 1 1 2 2 1 1 1 2 2 2 2 2 2 1 1 2 2 2 1 1 1 2 1 1 2 2 1 1 2 1 1 1 2 1 1 2 2 1 2 1 1 2 1 1 1 2 1 2 2 1 1 1 1 2 2 1 2 1 1 2 1 2 2 2 1 2 2 2 1 1 2 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 2 2 1 1 1 2 1 1 2 1 1 1 1 1 2 2 2 1 2 1 1 2 2 1 1 2 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 2 1 1 1 2 1 1 2 2 1 2 1 2 1 2 1 2 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1 1 2 1 1 2 1 1 1 1 2 2 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 2 2 2 1 1 1 1 2 1 1 2 1 1 1 2 2 2 1 1 2 2 1 2 2 1 1 1 1 2 1 2 1 1 1 2 1 1 2 2 1 1 2 1 1 1 1 2 1 1 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 2 1 1 2 2 1 2 2 2 1 1 2 1 2 2 1 2 1 1 1 2 1 1 1 2 2 1 2 1 1 1 1 1 1 1 2 1 2 2 1 2 2 2 1 1 1 1 2 1 1 1 1 2 1 1 2 1 1 1 1 1 2 2 1 1 1 1 2 2 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1] (999,)
# normalizando e padronizando os dados
# MinMaxScaler é usado para normalizar as variáveis, e StandardScaler é usado para padronizar
from sklearn.preprocessing import MinMaxScaler, StandardScaler
# dados x (features) normalizados
# X = np.array(X)
# normalizando
scaler = MinMaxScaler()
scaler.fit(X)
normalized_data = scaler.transform(X)
print(normalized_data)
# Padronizando
scaler = StandardScaler()
scaler.fit(X)
standardized_data = scaler.transform(X)
print(standardized_data)
print(standardized_data.shape)
X = standardized_data
[[0.64705882 0.31368989 0.33333333 ... 0. 1. 0. ] [0.11764706 0.10157368 0.33333333 ... 0. 1. 0. ] [0.55882353 0.41994057 0.33333333 ... 0. 1. 0. ] ... [0.11764706 0.03048311 1. ... 0. 1. 0. ] [0.60294118 0.08776274 1. ... 1. 1. 0. ] [0.60294118 0.23803235 0.66666667 ... 0. 1. 0. ]] [[ 2.24755338 0.94885997 -0.86919627 ... -0.8222983 0.19611614 -0.19611614] [-0.74010176 -0.41721553 -0.86919627 ... -0.8222983 0.19611614 -0.19611614] [ 1.74961086 1.633138 -0.86919627 ... -0.8222983 0.19611614 -0.19611614] ... [-0.74010176 -0.87505459 0.91932499 ... -0.8222983 0.19611614 -0.19611614] [ 1.99858212 -0.50616105 0.91932499 ... 1.21610369 0.19611614 -0.19611614] [ 1.99858212 0.46160866 0.02506436 ... -0.8222983 0.19611614 -0.19611614]] (999, 61)
Existem várias maneiras de normalizar e padronizar variáveis numéricas, cada uma com seus próprios benefícios e desvantagens.
A normalização é o processo de transformar uma variável numérica em uma escala específica, geralmente entre 0 e 1. Isso é feito subtraindo o valor mínimo da variável de cada valor e, em seguida, dividindo-o pelo intervalo (valor máximo menos valor mínimo). A fórmula para normalização é dada por:
(x - min(x)) / (max(x) - min(x))
A Padronização é o processo de transformar uma variável numérica para que tenha uma média de 0 e desvio padrão de 1. A fórmula para padronizar é:
(x - mean(x)) / std(x)
É importante notar que a normalização é apropriada quando se tem um conhecimento que os dados estão entre uma faixa específica, enquanto a padronização é mais apropriada quando se desconhece a distribuição dos dados. Ambas são utilizadas para evitar que atributos com grandes escalas dominem outros atributos com pequenas escalas. Além disso, em algumas áreas específicas, como a Rede Neural, é necessário normalizar os dados antes de treiná-los, pois algumas funções de ativação só funcionam corretamente se os dados estiverem dentro de uma determinada escala.
# Dividir os dados em conjunto de treinamento e teste
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=45)
# Criando o modelo
model = LogisticRegression(random_state=0,max_iter=1000)
# Treinando o modelo
model.fit(X_train, y_train)
clf2 = LogisticRegression(random_state=45,max_iter=1000).fit(X_train, y_train)
# Fazendo a previsão das classes
y_pred2 = clf2.predict(X_test)
# Avaliando o erro
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,y_pred2)
# Identificando Verdadeiros positivos (185), falso negativo (19), falso positivo (54) e verdadeiro negativo (42)
array([[185, 19],
[ 54, 42]], dtype=int64)
# Avaliando o modelo
# score = model.score(X_test, y_test)
from sklearn import metrics
score = metrics.accuracy_score(y_test, y_pred2)
print('Acurácia:', score)
# Percentagem de acerto
Acurácia: 0.7566666666666667
# Usando o modelo para previsão
predictions = model.predict(X_test)
print(predictions)
[1 1 2 1 1 2 1 2 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 2 1 1 1 2 2 1 2 1 1 1 1 1 2 2 1 1 2 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 2 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 2 1 2 1 1 1 1 1 2 2 1 2 1 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 2 1 2 1 2 1 1 2 1 1 1 1 1 2 1 1 1 2 2 2 1 1 2 1 2 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 2 1 1 2 1 2 1 2 1 2 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
import matplotlib.pyplot as plt
from scipy.special import expit
x = np.linspace(X_train.min(), X_train.max(), 100)
y = expit(x)
plt.plot(x, y)
plt.grid()
plt.xlim(X_train.min(), X_train.max())
plt.xlabel('x')
plt.title('expit(x)')
plt.show()
# Comparando a previsão com o valor real
print('O valor de y teste é:')
print(y_test)
print('O valor do y_pred é:')
print(predictions)
O valor de y teste é: [2 1 1 1 2 2 1 2 1 1 1 2 1 2 1 1 2 1 2 1 2 1 1 1 1 1 1 2 1 2 2 1 2 1 1 1 2 1 1 2 1 1 1 1 1 2 2 1 1 2 2 1 1 1 2 2 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 2 1 1 2 1 1 1 2 2 1 1 1 1 1 1 2 1 1 2 1 2 2 1 1 2 1 1 1 2 1 2 1 1 1 2 1 2 1 1 1 1 1 1 1 2 2 1 2 2 1 1 1 1 1 2 1 2 2 1 1 1 2 2 1 1 2 1 1 2 2 1 2 1 1 1 1 1 2 2 2 2 1 2 1 1 2 1 1 1 1 1 1 2 2 1 1 1 1 1 1 2 1 1 1 2 1 2 1 1 1 1 1 1 1 2 2 1 1 1 2 2 1 2 1 2 2 2 2 2 1 1 1 1 2 2 2 1 1 2 1 1 1 1 2 2 2 1 2 1 1 1 1 2 1 2 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 2 1 2 2 1 2 1 1 1 2 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 2 1 1] O valor do y_pred é: [1 1 2 1 1 2 1 2 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 2 1 1 1 2 2 1 2 1 1 1 1 1 2 2 1 1 2 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 2 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 2 1 2 1 1 1 1 1 2 2 1 2 1 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 2 1 2 1 2 1 1 2 1 1 1 1 1 2 1 1 1 2 2 2 1 1 2 1 2 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 2 1 1 2 1 2 1 2 1 2 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
# Fazendo a previsão das probabilidades
proba = clf2.predict_proba(X_test)
# Probabilidade de ser um bom pagador (índice 1-- quando y é 2.)
probabilidade_bom_pagador = proba[:,1]
list(probabilidade_bom_pagador)
[0.3575544075252283, 0.24646200250250866, 0.5037693260003645, 0.02497158281026338, 0.4082282634557703, 0.8396373573662871, 0.03100479224973886, 0.5643333501182854, 0.4242181473053391, 0.11741470925752992, 0.017331513227658395, 0.11632039581080272, 0.08397560168187373, 0.7561927547858469, 0.7135583243294291, 0.011459321688212348, 0.4014154578585699, 0.07180626182376008, 0.22654904845058418, 0.19910456663209714, 0.4358859682904516, 0.16064706315827837, 0.06315308004853747, 0.03921624763197009, 0.23000329405919429, 0.03727776191165742, 0.23763521504374635, 0.2765074985914869, 0.12548629380847817, 0.5835094740206213, 0.7173070421756433, 0.2591954913211902, 0.577429168385609, 0.30701291653610324, 0.20932621370528703, 0.021161840340502242, 0.6104823106627675, 0.9276146170465203, 0.02483958659114614, 0.7512586796134002, 0.09271098278817295, 0.23693867362439494, 0.11671643310558112, 0.21428413633601154, 0.4437466419840958, 0.6611530329661964, 0.8724020228133617, 0.06705332734274236, 0.014576300226733965, 0.6216404764826171, 0.6910763836788373, 0.03674269300011206, 0.17662300884992477, 0.3466366269818069, 0.8319921175105611, 0.24745715585768513, 0.3422182546227761, 0.3381095803910675, 0.045855326119499, 0.0008618225770872446, 0.260000674389578, 0.3165078372983348, 0.4541760632140915, 0.36362224281308475, 0.040390012170572295, 0.3328964934279847, 0.3853627257694505, 0.17377068023274117, 0.01671335107243509, 0.6676923364403986, 0.2896107050583962, 0.02816177160420736, 0.8271407228955002, 0.10922638168558309, 0.16579693885419425, 0.38213303393089015, 0.3693948972274035, 0.30889087725557884, 0.15099838712559335, 0.6394255082516275, 0.49918128312109455, 0.08527156332759557, 0.023441448908468797, 0.537347570171199, 0.535410661426317, 0.007368037565589074, 0.16748460224666506, 0.14662769993138539, 0.07609181146385095, 0.1033535251959754, 0.11008377363372224, 0.06591867639250963, 0.4147241203205284, 0.28265565343146853, 0.2676712936290625, 0.14709534347960432, 0.18653342966651124, 0.255547304135755, 0.2072330810137991, 0.08952167274048284, 0.542310248978556, 0.19075918008075152, 0.19408618713924397, 0.11440689674469721, 0.7694221920396651, 0.4768289997314638, 0.2746584264385379, 0.18309013463422494, 0.8629816557123261, 0.2508390824964314, 0.7257404682100814, 0.334329191211216, 0.3493588903392065, 0.1834277996480847, 0.1297379405533675, 0.3836788854161651, 0.7707768046579835, 0.6472648563744181, 0.08560745076081124, 0.6423943742708791, 0.3815972276244863, 0.29164465044947563, 0.07257334321246335, 0.3203193815337625, 0.3575743649644381, 0.7582595339949989, 0.3835684391495067, 0.014730771027073892, 0.42090159670857386, 0.709458703976667, 0.022988717480881198, 0.10011855358312954, 0.30311562883900495, 0.8518105713051055, 0.49610562848514583, 0.08040715785785975, 0.10077027015015134, 0.40127267122479404, 0.1466394141936303, 0.054585882491655566, 0.3301528829533764, 0.10042245440380752, 0.1938958828972719, 0.8681583046137111, 0.0209465353238438, 0.13468596577043324, 0.17852354850547478, 0.08006064165751602, 0.042853583132809535, 0.07598694668335522, 0.16184912625424785, 0.6928186870591087, 0.18831947720355394, 0.4829633563671534, 0.4698501166417016, 0.49893370823102773, 0.3257284554058735, 0.5798417515839235, 0.21817342547233387, 0.09771718811257014, 0.07838939751320799, 0.00515497679739128, 0.058593675734942285, 0.09280520240948022, 0.5406162334236413, 0.3897492002199035, 0.027662698917102994, 0.20851828869436245, 0.03944938270807578, 0.07902046966838665, 0.30543131026486725, 0.6753144139827264, 0.11958999321584891, 0.29502782879318984, 0.5689825809187508, 0.253615811593917, 0.7342025433730487, 0.09123960043283542, 0.6175161257429956, 0.3603739986732856, 0.08238463831336695, 0.6696850785400043, 0.02608541624922744, 0.13579328407937477, 0.09697171317317975, 0.10034545088353022, 0.29031258084575445, 0.8698943407845949, 0.35313138099401575, 0.06553664909409156, 0.1068079532821719, 0.685908658186423, 0.5482284595570969, 0.5646061571469965, 0.479217330372309, 0.2092830665422106, 0.842194218050312, 0.08642265676340156, 0.7332858330185177, 0.05809875318042353, 0.655821895058533, 0.13438755846433945, 0.3536768451318451, 0.0722683282597778, 0.047556112634481264, 0.15880201285219217, 0.4948468042872455, 0.12426060908600176, 0.04916399703850904, 0.33529062499616447, 0.1729419070608934, 0.5443844134780365, 0.1372742228598304, 0.08105793151102407, 0.026355000042478035, 0.4750218701112715, 0.4582904380064078, 0.8136069668498618, 0.03369053657328305, 0.6515229122724818, 0.03993219066672375, 0.23063778317404887, 0.7344811681565733, 0.14684919713568628, 0.5813562489170003, 0.04692807187071779, 0.7681084834897773, 0.08046057204348747, 0.6402117576413653, 0.11785868043114847, 0.1686192478241355, 0.2616437744646368, 0.4109510724635726, 0.2044275725808479, 0.4634922224345441, 0.03095144467955977, 0.18735121034037153, 0.20492331502755637, 0.3593195162167721, 0.5447129782556162, 0.2614392613154673, 0.5111904906109281, 0.23429582873180563, 0.18227850318457764, 0.55621216583446, 0.27911438058202537, 0.6095383881761665, 0.4459058237353334, 0.1869510790123145, 0.08477190876492091, 0.0661678419893313, 0.4160489033215436, 0.3467324228372075, 0.08796979956876608, 0.02338289909353417, 0.2126362965280613, 0.009325034006975201, 0.15543568553371795, 0.07387914884150391, 0.38078042741939827, 0.19754820613843088, 0.022336742891443658, 0.3380866885734076, 0.22243833741439498, 0.2871501717276802, 0.05515751987172381, 0.248623852721466, 0.12008061941077053, 0.07859322241993756, 0.14250397742734447, 0.7214423960008741, 0.41127124487458505, 0.5360035696724533, 0.46714070612386294, 0.02548452142927554, 0.2072218769427384, 0.25097624777714256, 0.465524686263997, 0.6523852961052731, 0.06420258998082433, 0.19365651242178614, 0.32609863617333457, 0.3503430433202252, 0.19124182566082312, 0.31883049707970573, 0.10816775512199842, 0.06130382173838721, 0.47585547238675374, 0.08599657360980478, 0.11739870970398011, 0.09514391286665884, 0.2721473399854725, 0.16600162110106412, 0.10738088178760616, 0.3933802020259121, 0.12083211242707217, 0.12299586183675537, 0.2578650216154702, 0.1248026794042997, 0.019613963695723234]
#
#
#
#
#
#
#
#
expit(x*w1+w0)
# Verificando o coef_
clf2.coef_
# print clf2.coef_
# e o intercept
clf2.intercept_
#print(clf2.intercept_)
# Verificando o coeficiente angular
w1 = clf2.coef_[0][0]
print('O coeficiente angular é', w1)
# E o coeficiente linear
w0 = clf2.intercept_[0]
print('O coeficiente linear é', w0)
O coeficiente angular é 0.24501134423048582 O coeficiente linear é -1.2413081399442563
expit do scipyexpit(x) = 1/(1+exp(-x))# Importando o expit
from scipy.special import expit
# Calculando o valor para cada valor de x e y
valores_x = np.linspace(X_train.min(),X_train.max(),100)
valores_y = expit(w1*valores_x+w0)
# Podemos exibir os dados de treino em cima dessa curva
y_curva = expit(w1*X_train+w0)
print(y_curva)
print(y_curva.shape)
[[0.19424782 0.19075467 0.15797471 ... 0.19111513 0.23267685 0.21596136] [0.39016257 0.27297767 0.26579497 ... 0.19111513 0.23267685 0.21596136] [0.23529712 0.25273044 0.18934511 ... 0.28022437 0.23267685 0.21596136] ... [0.21405734 0.20386926 0.26579497 ... 0.28022437 0.23267685 0.21596136] [0.28198528 0.26536054 0.18934511 ... 0.28022437 0.23267685 0.21596136] [0.23529712 0.19435886 0.15797471 ... 0.19111513 0.23267685 0.21596136]] (699, 61)